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Genetic regulation is a key component in development, but a clear understanding of the structure 
and dynamics of genetic networks is not yet at hand. In this work we investigate these properties 
within an artificial genome model originally introduced by Reil [l5|. We analyze statistical properties 
of randomly generated genomes both on the sequence- and network level, and show that this model 
correctly predicts the frequency of genes in genomes as found in experimental data. Using an 
evolutionary algorithm based on stabilizing selection for a phenotype, we show that robustness 
against single base mutations, as well as against random changes in initial network states that 
mimic stochastic fluctuations in environmental conditions, can emerge in parallel. Evolved genomes 
exhibit characteristic patterns on both sequence and network level. 



I. INTRODUCTION 

The transcription of DNA into mRNA and subsequent 
translation into protein is the fundamental genetic pro- 
cess; it is the crucial first step by which genetic informa- 
tion gives rise to an organism. Development is not such a 
linear process, however. By binding to specific regions of 
the genome, the protein produced by one gene can affect 
the activity of other genes, and those genes may in turn 
express proteins that enhance or inhibit still more genes. 
A network of interactions responsible for the regulation 
of genetic activity is thus defined. Such genetic regula- 
tion is important if cells are to have independent control 
over their behavior. 

Today, the available amount of data for regulatory in- 
teractions in a number of model organisms, as, for ex- 
ample, Yeast [l9| is steadily increasing. A number of 
distinguishing structural properties have been identified, 
namely scale- free degree distributions [13] , motifs @ and 
modular organization (l8| . 

Still, there is not enough information to suggest a com- 
prehensive theory of how genetic regulatory networks at- 
tain a particular structure, how genes in such networks 
interact and respond to perturbation, and how evolution 
has shaped these factors. This study is an attempt to 
explore these questions in the context of one particular 
model [15], in the hopes that it has features that cor- 
respond to the limited data currently available, and so 
that some progress toward a comprehensive theory might 
eventually be made. 

Traditionally, attempts to understand the character- 
istics of regulatory networks have focused on dynamical 
properties. That is, a network topology is specified and 
rules are applied to describe how each gene in the network 
responds to inputs. Some initial state is then assigned 
and the time evolution of gene activity is studied. A va- 
riety of rules have been used, including Boolean switches 



fill ], thresholds [TH, [r|> and differential equations [|[. 
Much less work has been done in understanding how 
the machinery of transcription, translation, and binding 
might act throughout the genome to produce the topol- 
ogy of a genetic network. In fact, most studies of genetic 
networks ignore modeling DNA-specific processes alto- 
gether Q . The first part of our study examines to what 
extent Reil's model [15j . which includes explicit parame- 
tcrizations for transcription and translation, can produce 
realistic genetic networks based on random genome real- 
izations. 

A description of the method we will use for building 
genetic regulatory networks follows, along with compar- 
isons to published and publically available experimen- 
tal data. Statistical properties of random realizations 
of artificial genomes are derived, and related to net- 
work structure. Next, we investigate the dynamics of 
our modeled networks when applying threshold dynam- 
ics to gene behavior. Although this is a strong simpli- 
fication, this type of discrete dynamics has been suc- 
cessfully applied in a number of studies that are con- 
cerned with the co-evolution of network dynamics and 
-structure @, HI, El- Finally, we are interested in un- 
derstanding the role evolution might play in selecting 
particular network topologies. This is explored by ask- 
ing how genome structure changes when those networks 
with certain dynamical properties are preferentially se- 
lected. Similar questions have been addressed in a small 
number of previous proof-of-principle studies using ar- 
tificial genomes [H l9l. Tl2j . however, without relating the 
observed adaptation to changes in sequence and network 
topology. In particular, we investigate a scenario of sta- 
bilizing selection similar to previous studies concerned 
with the evolution of developmental canalization [H . We 
find evolution towards robustness of regulatory dynamics 
against both noise, modeled as fluctuating initial condi- 
tions, and against mutations. We show that, in principle, 
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FIG. 1: Transcription, translation, and binding in an artifi- 
cial genome. The base promoter sequence is '01010' and is 
indicated by dark shading. Light shading shows genes and 
proteins. After [la |. 



this phenotypic robustness can be traced back to adap- 
tive changes on the sequence level that lead to emergence 
of more robust regulatory networks. 



complex interplay between various factors that do not 
only depend on sequence. In our study, we make the 
simplifying assumption that a TF is either activating or 
inhibiting, which is determined by the sum s g of its se- 
quence: if s g < (l/2)s max , where s max is the maximal 
possible cross sum value, it is inhibiting, otherwise it is 
activating. 

Clearly this model greatly simplifies the true transcrip- 
tion, translation, and binding processes. The binding of a 
transcription factor to a cis-site, for example, depends on 
the protein's structure, shape, and environment, rather 
than a simple template matching approach. Moreover, 
there is a stochastic element to all these processes that 
is simply ignored here. 

Although it represents a strong simplification, the 
model does have biological justification [l5|. The use 
of a base promoter sequence is reminiscent of the TATA 
box frequently found in eukaryotic organisms. Binding is 
modeled in a DNA-specific way, just as in real organisms. 
Additionally, the model has the potential for greater ex- 
tendability than some models (e.g., Boolean networks) 
because it includes DNA-specific transcription, transla- 
tion, and binding. The impact of single base pair mu- 
tations on gene function and network structure can be 
studied with this model, and also the effect of sequence 
duplications (resulting in gene duplication) or -deletions 
[lj|. In this paper, we will restrict ourselves to single 
base pair mutations, and keep the genome size constant. 



II. MODEL DETAILS 

A. Regulatory network construction from random 
sequences 

An artificial genome can be constructed as follows (also 
see Fig. [IJ . Randomly string together S integers drawn 
uniformly between and 3. The use of 4 digits need 
not be the case, but does provide correspondence with 
the ATGC alphabet of real genomes. For the purpose 
of generalization, the length of the alphabet in the arti- 
ficial genome may in principle take any positive integer 
value A. Next, define a base promoter sequence of length 
l p to indicate the position of genes in the genome, say 
'01010'. Wherever the promoter sequence occurs, the 
next l g digits are specified as a gene. Translation of the 
gene sequence into a protein occurs simply. Each num- 
ber in the sequence is incremented by 1 and any values 
greater than the last number in the base set of digits be- 
come the first number (e.g., the gene '012323' becomes 
the protein '123030'). Binding sites are determined by 
searching the genome for the protein sequence. If a match 
is found, then the protein is a transcription factor (TF) 
that binds to that site and that regulates the next down- 
stream gene. TFs may enhance or inhibit gene activity. 
In this study each TF has equal contribution to a gene's 
state and has equal probability of activating or suppress- 
ing gene expression. In real genetic systems, a TF may 
activate some genes and inhibit others, depending on a 



1. Regulatory dynamics 

Dynamics of state changes (activity or inhibition of 
genes) on the constructed networks can be defined in var- 
ious ways. In our study, we apply random threshold net- 
work (RTN) dynamics: An RTN consists of N randomly 
interconnected binary sites (spins) with states <7i = ±1. 
For each site i, its state at time t + 1 is a function of the 
inputs it receives from other spins at time t: 

ff<(f + l) = sgn(/s(t)) (1) 

with 

JV 

m= (*)+'*• ( 2 ) 

3=1 

The N network sites are updated synchronously. In the 
following discussion the threshold parameter h is set to 
zero. The interaction weights take discrete values 
Cij = +1 or —1 with equal probability. If i does not 
receive signals from j, one has = 0. 

For a finite system size N, the dynamics of RTN, which 
are closely related to Boolean networks [ll| converge to 
periodic attractors (limit cycles) after a finite number of 
updates. It has been suggested that different limit cycles 
may correspond to different gene expression states (cell 
types) This property of RTN is also advantageous 
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for defining phenotypes in artificial evolutionary scenar- 
ios that are subject to various kinds of selective pressure 



III. STATISTICAL PROPERTIES OF THE 
ARTIFICIAL GENOME 

In the following, N denotes the number of genes in 
the artificial genome. N is directly related to S, the 
number of bases, via the combinatorial construction of 
the artificial genome. 



A. Genome size scaling 



Under the assumption that l g <C A' p , one derives easily 
the following two relations: 



and 



(3) 



N(S,l p ) = [j-) S 
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S(N, l p ) = X lp ■ N (4) 

B. Length of binding regions 

1. Average length 

Constructing the artificial genome by random sampling 
from an alphabet of length A, the probability p p to get 
a promoter sequence after a given point of time is p v = 
A~' p . Thus the expectation value for the length of the 
binding regions in Reil's artificial genome is given by 



bhid , 



p P 



A< p - lo 



(5) 



2. Statistical distribution of hind 

To derive the exact statistical distribution of the 
lengths Ifnnd of the binding regions in the A.G., we 
first remark that the random production of promoter se- 
quences during the process of genome construction is a 
Bernoulli chain of length n with the two possible events: 
"0" - a certain base does not mark the begin of a pro- 
moter sequence and "I" - a certain base marks the start 
of a promoter sequence. Hence, the probability to pro- 
duce k promoters with n random sampling steps is given 
by 



P(k,n) = Q .p*(l-ft,)' 



-A' 



(6) 



FIG. 2: The probability of having K ou t regulatory outputs 
(a) and the probability of having Ki n regulatory inputs for 
random genomes with different gene lengths l g , averaged over 
10 4 realizations. 



By setting fc = 1 and n = Ibmd + lg + 1 we get the 
probability pi to produce one promoter within n = lbind+ 
l g + 1 sampling steps: 



Ibind lq ~\~ 1 \ , 
/'I = 1 x ) -Pp^-Pp) 



Ibind + lg 



(7) 



Only one of these possibilities is the correct one (produc- 
tion of a promoter with the last sampling step), i.e. for 
the derivation of p(lbind) we have to divide Eq. ([6]) by 
the binomial coefficient, leading to 



pQbind) = - p p ) lb '" d+l3 

= A~' P (I — X^py^i+lg 

= A-' p CX P [-a • {hind + lg)}, 



(8) 
(9) 



which is a decaying exponential distribution with a 
-ln(l-A-H 



C. Network connectivity 

In this section, we relate the previously derived proper- 
ties of the artificial genome to characteristic parameters 
of the resulting networks. 
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1. Average connectivity 

For a given TF, the probability to match to a random 
"DNA" sequence of length l g is given by pund = A~' 9 ■ 
There are (hind) — lg + 1 possibilities to bind to a binding 
region of average length (hind), thus the probability that 
the TF provides at least one input to the gene defined by 
the promoter sequence following a given binding region 
on average is 

(Pinput) = ((hind) -lg + 1) ■ A~' 9 

= (X 1 " -2l g + l) ■ A~' 9 . (10) 

Since we have N binding regions, the average connectiv- 
ity (averaged over the whole ensemble of possible random 
genomes) scales linear with the number of genes, 

(k) = (pinput) ■ N, (11) 

and the slope depends on A, l g and l p . 

Notice, however, that the average connectivity K ob- 
tained from a particular genome realization can substan- 
tially deviate from this typical value, since the possible 
values of K are Gaussian distributed around (k). 

2. In- and outdegree distribution 

From the above considerations, it is straight-forward to 
derive the statistical distributions for the number of ingo- 
ing and outgoing links in randomly constructed genomes. 
Since a TF has equal a priori probability to bind at any 
region of the base string, generation of out-links is a Pois- 
son process, and hence the outdegree distribution is a 
Poissonian (Fig. 2a): 

P(k out ) = &^e X p[-(K}}. (12) 

The number of inputs a gene receives, however, is propor- 
tional to the length Ui n d of its associated binding region, 
hence, it follows from Eq. 

P(kin) ~ exp [-pkm], (13) 

i.e. the indegree distribution is exponential. Both results 
are confirmed by numerical simulations (Fig. 2a and 2b) . 

D. Relevance to biology 

Clearly, random genome realizations ar far from being 
a realistic model of real biological genetic systems. How- 
ever, it can be shown that even this extreme oversimpli- 
fication has some relevance for biology. In Figure [3J the 
predicted number of genes in a genome, N — (l/4) ip • S, 
is plotted as a function of genome size for l p = 5. Ob- 
served data from 50 organisms that have been completely 
sequenced are also shown. The correspondence between 
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FIG. 3: The number of genes predicted from the model as a 
function of genome size S with l p — 5. The number of genes 
in 50 organisms is plotted for comparison. Observed data are 
taken from http://www.ultranet.com/~jkimball/ 



model and data is excellent for this range of S and shows 
that a combinatorial method for determining the number 
of genes in a genome is appropriate. For larger S, l p = 7 
is reasonable (not shown), but little observed data exists. 
On the other hand, statistical distributions of regulatory 
inputs and outputs do not match biological data particu- 
larly well; here, more realistic statistics can be obtained 
by constructing artificial genomes from duplication and 
divergence events jlij . However, even in these models 
the question how selection pressure on the phenotype, as 
encoded by network dynamics, may influence genome or- 
ganization, remains unanswered. This type of question 
shall be addressed in the remaining part of this paper. 



IV. STABILIZING SELECTION FOR A 
PHENOTYPE - AN EVOLUTIONARY 
SCENARIO 

Though evolved by the random processes of genetic 
drift and selection pressure from changing environments, 
real genetic systems are far from being random. Com- 
mon wisdom is that this is often due to the highly non- 
linear nature of the genotype-phenotype map, which in- 
cludes an intermediate layer of complex regulatory pro- 
cesses controlling cell machinery (unicellular organisms) 
or highly structured developmental processes (multicellu- 
lar organisms) . The multilevel-structure of the involved 
evolutionary processes is sketched schematically in Fig. 
4. Typically, models of evolutionary adaptation focus ei- 
ther on sequence evolution or network structure alone, 
and hence imply a huge loss of information as compared 
to the true multi-level and multi-scale evolutionary dy- 
namics. Artifical genomes could be an important step to- 
wards models that integrate these levels, and hence may 
lead to predictions on the effects of adaptive processes on 
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FIG. 4: Multilevel structure of the evolving genotype- 
phenotype map, involving an intermediate level of complex 
regulatory networks. 



sequence- and network evolution, and how these are re- 
lated to each other. In this section, we briefly explore an 
example of an evolutionary scenario based on an artificial 
genome, motivated by the observation that development 
is highly canalized, i.e. buffered against both intrinsic 
and environmental noise, and mutations [13] • A number 
of studies has demonstrated that stabilizing selection for 
particular phenotypes leads to emergence of this high ro- 
bustness, strongly fascilitated by the high amount of neu- 
trality contained in the fitness landscapes of complex reg- 
ulatory networks Q. Let us now define an evolutionary 
algorithm of stabilizing selection in a strongly fluctuating 
environment, based on an artificial genome. We start by 
generating an initial population of randomly assembled 
genomes; the number of bases is constrained such that 
each string contains exactly 64 genes. Next, different 
limit cycles of the associated RTNs are identified by run- 
ning network dynamics, as defined in section 2.1.1, from 
10000 different random initial state configurations. This 
process is stopped when a RTN is identified which has a 
fixed point Sf (a limit cycle of length one), and at least 5 
additional attractors; the relative weight of the basin of 
attraction leading to Sf should be small (less than 40% 
of the tested configurations). The last two criteria are 
chosen to rule out a to quick convergence of evolution- 
ary dynamics (i.e., to make the problem hard). Sf is the 
phenotype we want to stabilize, and the digit string Gf, 
that codes for its regulatory network, is the genotype we 
evolve. 

We now apply stabilizing selection as follows: 

1. Create a mutant Gf by random single base muta- 
tions, occurring with a probability p m per base. 

2. Run RTN dynamics from a random initial state, 
until an attractor is reached, otherwise stop after 
200 iterations. 

3. If dynamics has converged to Sf, keep Gf, other- 
wise keep Gf. 

4. For the next generation, iterate from (1). 
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FIG. 5: Example of an evolutionary run. Left: Evolution of 
the robustness Rf against fluctuations in initial conditions. 
Right: Evolution of the mutational robustness R m (thick line: 
cumulative average). 




FIG. 6: Left: evolution of the outdegree-distribution, right: 
evolution of the indegree-distribution in the same evolution- 
ary run as in Fig. 5. 



We note that we disregard mutations of promoter sites, 
as well as mutation leading to new "genes" , to avoid com- 
plications resulting from a varying genome size. Notice 
that, in step (2), we test only one initial configurations, 
corresponding to the fact that biological organisms are 
tested only against the environment they face at the cur- 
rent generation. Robustness against fluctuations, i.e. the 
capacity to stabilize the phenotype in a large number of 
possible environments, is measured by running RTN dy- 
namics for Gf (Gf) for a larger set Z of initial configu- 
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FIG. 7: Number of base exchanges during evolution for differ- 
ent positions on the genome, averaged over regions contain- 
ing 100 bases each. The brightness in grayscale indicates the 
number of bases exchanges. Shaded lines running from left to 
right indicate conserved regions. 



rations (e.g. 10 4 random initial states). Then 



Rf(t) 



z f (t) 
z 



(14) 



defines the robustness against fluctuations, where Zf(i) 
is the fraction of initial states that lead to Sf at gen- 
eration t. A second measure of robustness is associated 
to the capacitance to buffer the system against disad- 
vantageous mutations (mutational robustness R m , 0]). 
At each generation we measure the number of accepted 
mutants P a in the previous P generations, and define 



Rm (t) 



Pgjt) 
P ' 



(15) 



If P a , and hence R m increases with t, this indicates 
restructuring of the genome such that the probability of 
neutral or advantageous mutations with respect to Sf 
has increased. Fig. 5 shows both quantities in a typical 
evolutionary run. Both Rf and R m increase rapidly, how- 
ever, exhibiting considerable fluctuations. In particular, 
R f exhibits an interesting intermittent dynamics reminis- 
cent of a punctuated equilibrium [3| , indicating metasta- 



bility of the evolutionary dynamics. In fact, in all evolu- 
tionary runs we studied Rf and R m could be stabilized 
only over a finite number of generations, as indicated in 
Fig. 5 by the sharp decrease of both quantities around 
t = 1500. Rf and R m are positively correlated, similar to 
the results reported in The evolutionary instability 
is an inevitable consequence of the fact that Sf, in each 
generation, is tested only against a very limited set of 
mutations and environments. The artificial genome now 
allows us to trace the effects of this non-trivial evolution- 
ary dynamics on both network and sequence structure. 

Figure 6 shows the evolution of the distributions of 
regulatory input and output numbers per gene, in the 
same evolutionary run as shown in Fig. 5 with regards 
to adaptation dynamics. Evidently, considerable reorga- 
nization of network structure is taking place: while the 
indegree-distribution tends to become narrower, the peak 
of the outdegree-distribution is shifted towards larger k. 
However, these trends are not particularly pronounced, 
probably due to the small genome size applied (N = 64 
genes). Interestingly, sequence information turns out to 
be more informative for generating hypotheses how ro- 
bustness evolves. Figure 7 shows the number of base ex- 
changes during evolution for different positions on the 
genome. At each generation, the cumulative number 
of base changes in successive slices of 100 digits on the 
genome string during all previous generations was moni- 
tored. Different gray shades in Fig. 7, that are main- 
tained over the whole evolutionary run, indicate that 
there are conserved regions, while in other regions base 
changes accumulate more rapidly. While we do not yet 
have a conclusive explanation for this observation, a ten- 
tative hypothesis would be that the conserved regions 
encode binding sites of genes that regulate the invariant 
"core dynamics" of the phenotype, while the regions with 
more frequent base substitutions are responsible for sta- 
bilizing it against fluctuations, or contain mostly neutral 
mutations. A detailed analysis of the correlations be- 
tween sequence- and network evolution, which goes be- 
yond the scope of our present study, may shed more light 
on this problem. 
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